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Abstract 

The purpose of sufficient dimension reduc- 
tion (SDR) is to find the low-dimensional 
subspace of input features that is sufficient 
for predicting output values. In this pa- 
per, we propose a novel distribution-free SDR 
method called sufficient component analysis 
(SCA), which is computationally more effi- 
cient than existing methods. In our method, 
a solution is computed by iteratively per- 
forming dependence estimation and maxi- 
mization: Dependence estimation is analyt- 
ically carried out by recently-proposed least- 
squares mutual information (LSMI), and de- 
pendence maximization is also analytically 
carried out by utilizing the Epanechnikov ker- 
nel. Through large-scale experiments on real- 
world image classification and audio tagging 
problems, the proposed method is shown to 
compare favorably with existing dimension 
reduction approaches. 



1. Introduction 

The goal of sufficient dimension reduction (SDR) is to 
learn a transformation matrix W from input feature x 
to its low-dimensional representation z (= Wx) which 
has 'sufficient' information for predicting output value 
y. SDR can be formulated as the problem of finding z 
such that x and y are conditionally independent given 
z (Cook, 1998; Fukumizu ct al., 2009). 

Earlier SDR methods developed in statistics commu- 
nity, such as sliced inverse regression (Li, 1991), prin- 
cipal Hessian direction (Li, 1992), and sliced average 
variance estimation (Cook, 2000), rely on the ellipti- 
cal assumption (e.g., Gaussian) of the data, which may 
not be fulfilled in practice. 
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To overcome the limitations of these approaches, 
the kernel dimension reduction (KDR) was proposed 
(Fukumizu et al., 2009). KDR employs a kernel-based 
dependence measure, which does not require the ellip- 
tical assumption (i.e., distribution- free) , and the so- 
lution W is computed by a gradient method. Al- 
though KDR is a highly flexible SDR method, its criti- 
cal weakness is the kernel function choice — the perfor- 
mance of KDR depends on the choice of kernel func- 
tions and the regularization parameter, but there is 
no systematic model selection method available. Fur- 
thermore, KDR scales poorly to massive datasets since 
the gradient-based optimization is computationally de- 
manding. Another important limitation of KDR in 
practice is that there is no good way to set an initial 
solution — many random restarts may be needed for 
finding a good local optima, which makes the entire 
procedure even slower and the performance of dimen- 
sion reduction unstable. 

To overcome the limitations of KDR, a novel 
SDR method called least-squares dimension reduction 
(LSDR) was proposed recently (Suzuki & Sugiyama, 
2010). LSDR adopts a squared-loss variant of mutual 
information as a dependency measure, which is effi- 
ciently estimated by least-squares mutual information 
(LSMI) (Suzuki et al., 2009). A notable advantage of 
LSDR over KDR is that kernel functions and its tun- 
ing parameters such as the kernel width and the reg- 
ularization parameter can be naturally optimized by 
cross-validation. However, LSDR still relics on a com- 
putationally expensive gradient method and there is 
no good initialization scheme. 

In this paper, we propose a novel SDR method called 
sufficient component analysis (SCA), which can over- 
come the computational inefficiency of LSDR. In SCA, 
the solution W in each iteration is obtained analyti- 
cally by just solving an eigenvalue problem, which sig- 
nificantly contributes to improving the computational 
efficiency. Moreover, based on the above analytic-form 
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solution, we develop a method to design a good initial 
value for optimization, which further reduces the com- 
putational cost and help obtain a good local optimum 
solution. 

Through large-scale experiments using the PAS- 
CAL Visual Object Classes (VOC) 2010 dataset 
(Everingham et al., 2010) and the Freesound dataset 
(The Freesound Project, 2011), we demonstrate the 
usefulness of the proposed method. 

2. Sufficient Dimension Reduction with 
Squared-Loss Mutual Information 

In this section, we formulate the problem of sufficient 
dimension reduction (SDR) based on squared-loss mu- 
tual information (SMI). 

2.1. Problem Formulation 

Let X(c M. d ) be the domain of input feature x and 
y be the domain of output data 1 y. Suppose we are 
given n independent and identically distributed (i.i.d.) 
paired samples, 

D n = {{xi,yi) | Xi G X, ytEy, i = l,...,n}, 

drawn from a joint distribution with density p xy (x, y). 

The goal of SDR is to find a low-dimensional represen- 
tation z (g M m , m < d) of input x that is sufficient to 
describe output y. More precisely, we find z such that 



y±Lx\ z, 



(1) 



meaning that, given the projected feature z, the fea- 
ture x is conditionally independent of output y. 

In this paper, we focus on linear dimension reduction 
scenarios: 

z = Wx, 

where W is a transformation matrix. W belongs to 
the Stiefel manifold §f n (M): 

S^,(R) := {W G R mxd \WW T = I m }, 

where T denotes the transpose and I m is the Tri- 
dimensional identity matrix. Below, we assume that 
the reduced dimension m is known. 



1 y could be either continuous (i.e., regression) or cat- 
egorical (i.e., classification). Multi-dimensional outputs 
(e.g., multi-task regression and multi-label classification) 
and structured outputs (such as sequences, trees, and 
graphs) can also be handled in the proposed framework. 



2.2. Dependence Estimation-Maximization 
Framework 

Suzuki & Sugiyama (2010) showed that the optimal 
transformation matrix that leads to Eq.(l) can be 
characterized as 

W* = argmax SMl(Z,Y) s.t. WW T = I m . (2) 

W<EM mxd 

In the above, SMI(Z, Y) is the squared-loss mutual in- 
formation: 



SMI(Z,F) :=-E Pz , Py 



Pzy(z,y) 

Py(y)Pz(z) 



where E PziPy denotes the expectation over the 
marginals p z (z) and p y (y). Note that SMI is the Pear- 
son divergence from p zy (z, y) to p z (z)p y (y), while the 
ordinary mutual information is the Kullback-Lciblcr 
divergence fromp zy (z,y) to p z (z)p y (y). The Pearson 
divergence and the Kullback-Leibler divergence both 
belong to the class of /-divergences, which shares sim- 
ilar theoretical properties. For example, SMI is non- 
negative and is zero if and only if Z and Y are statis- 
tically independent, as ordinary mutual information. 

Based on Eq.(2), we develop the following iterative 
algorithm for learning W: 

(i) Initialization: Initialize the transformation ma- 

trix W (see Section 3.3). 

(ii) Dependence estimation: For current W, an 
SMI estimator SMI is obtained (see Section 3.1). 

(iii) Dependence maximization: Given an SMI 
estimator SMI, its maximizer with respect to W 
is obtained (see Section 3.2). 

(iv) Convergence check: The above (ii) and (iii) 
are repeated until W fulfills some convergence cri- 
terion 2 . 

3. Proposed Method: Sufficient 
Component Analysis 

In this section, we describe our proposed method called 
the sufficient component analysis (SCA). 

3.1. Dependence Estimation 

In SCA, we utilize a non-parametric SMI estima- 
tor called least-squares mutual information (LSMI) 



2 In experiments, we used the criterion that the im- 
provement of SMI is less than 10~ 6 . 
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(Suzuki et al., 2009), which was shown to achieve the 
optimal convergence rate (Suzuki & Sugiyama, 2010). 
Here, we review LSMI. 

3.1.1. Basic Idea 

A key idea of LSMI is to directly estimate the density 
ratio, 

p z (z)p y (y) 

without going through density estimation of p zy {z, y), 
p z (z), and p y (y). Here, the density ratio function 
w(z,y) is directly modeled by 



w a (z,y) = ^2atK(z,z e )L(y,y e ), 



(3) 



where K(z, z') and L(y, y') are kernel functions for z 
and y, respectively. 

Then, the parameter a = (pt\, . . . , a n ) T is learned so 
that the following squared error is minimized: 

J (a) = ^Ep z , Py [(w a (z,y) - w(z,y)) 2 ] . 
Jo can be expressed as 



where 
J(a) 

hi 



Jo (a) = J (a) + SMI(Z, Y) + -, 



-a T Ha — h T a, 
2 



3 



[K(z, z e )L{y, y e )K(z, z t )L{y, y e ,)} , 



E_ [K(z,z e )L(y,y e )} 



and SMI(Z, F) is constant with respect to a. Thus, 
minimizing Jo is equivalent to minimizing J. 

3.1.2. Computing the Solution 

Approximating the expectations in H and h included 
in J by empirical averages, we arrive at the following 
optimization problem: 

—a Ha — h T a + \a T Ra 

where a regularization term Xa T Ra is included for 
avoiding overhtting, A (> 0) is a regularization param- 
eter, R is a regularization matrix, and, for Zj = Wxi, 



1 ™ 

#v = — ^2 K ( z i> z t) L (yi>vt) K ( z 3> z t') L (.Vj>vt' 

1 " 

^ = - ^K{zi,Z()L{y u y e ). 



Differentiating the above objective function with re- 
spect to a and equating it to zero, we can obtain an 
analytic-form solution: 

a = (H + \R)- 1 h. (4) 
Based on the fact that SMI(Z, Y) is expressed as 
SMI(Z, Y) 



1 r , 1 
-E Pzy [w(z,y)\ - -, 



the following SMI estimator can be obtained: 

SMI = -h T a - -. 
2 2 



3.1.3. Model Selection 



(5) 



Hyper-parameters included in the kernel functions 
and the regularization parameter can be optimized by 
cross-validation with respect to J. 

More specifically, the samples Z = {(zj, yi)}" =1 are 
divided into K disjoint subsets {Zk] k=1 of (approx- 
imately) the same size. Then, an estimator az k is 
obtained using Z\Z k (i.e,. all samples without Zk), 
and the approximation error for the hold-out samples 
Zfc is computed as 



J 



(K-CV) 

z k 



where, for \Z^\ being the number of samples in the 
subset Z k , 



i 



K(z,z e )L{y,y e ) 
x K(z',z e ,)L(y',y e ,), 



\hz k ]t = j^-r K ( z > z e) L (y,ye)- 
( Z ,y)ez k 

This procedure is repeated for k = K, and its 

average j( K - cv > is outputted as 

i K 

j(K-CV) = J_ V J (K - CV) 
' i z k 
k=l 

We compute j( K ~ cy ) for all model candidates, and 
choose the model that minimizes j( K ~ cv ) . 

3.2. Dependence Maximization 

Given an SMI estimator SMI (5), we next show how 
SMI can be efficiently maximized with respect to W: 

max SMI s.t. WW T = I m . 

wm mxd 



Sufficient Component Analysis 



We propose to use a truncated negative 
quadratic function called the Epanechnikov ker- 
nel (Epancclmikov, 1969) as a kernel for z: 



\z - z £ \ 

2cr? 



K(z, zg) — max 0, 1 



Let 1(c) be the indicator function, i.e., 1(c) = 1 if c is 
true and zero otherwise. Then, for the above kernel, 
SMI can be expressed as 



SMI = -tr (WDW 

2 v 



where tr(A) is the trace of matrix A, and 



the m principal components of D^>: 



l i=i 

1 T 1 

— id - 7T- 

m lo 



3 (o) = (ff(o) + AH)- 1 h(°), 
- 1 " 

^t}> = — X! K '{ x ii x l) L (Viiyd 



x K'(xj,xt>)L(yj,y v ), 



- 1 ™ 

^ 0) = - ^2 K '( x i' x i) L (yi,yi)i 



K'(x, xe) = max ( 0, 1 



|a - g£| 

cr x is the kernel width and is chosen by cross-validation 
(see Section 3.1.3). 



i — 1 tf— i \ z / 



L(Vi,Vt) 



— Id - 7T^( X i ~ ^X 3 ^ _ x *) 



Here, by ae(W), we explicitly indicated the fact that 
on depends on W . 

Let D' be D with W replaced by W, where W 
is a transformation matrix obtained in the previous 
iteration. Thus, D' no longer depends on W. Here 
we replace D in SMI by £)', which gives the following 
simplified SMI estimate: 



1 



tr (WD'W T ) 



(6) 



A maximizer of Eq.(6) can be analytically obtained 
by (wi\ ■ ■ ■ |u; rn ) T , where {u>i}"= i are the m principal 
components of D' . 

3.3. Initialization of W 

In the dependence estimation-maximization frame- 
work described in Section 2.2, initialization of the 
transformation matrix W is important. Here we pro- 
pose to initialize it based on dependence maximization 
without dimensionality reduction. 

More specifically, we determine the initial transforma- 



tion matrix as • • • \wm') T , where {•i/?) u ''}£L 1 are 



(0Kt 



.(0)! 



4. Relation to Existing Methods 

Here, we review existing SDR methods and discuss the 
relation to the proposed SCA method. 

4.1. Kernel Dimension Reduction 

Kernel Dimension Reduction (KDR) 

(Fukumizu et al., 2009) tries to directly maxi- 
mize the conditional independence of x and y given z 
under a kernel-based independence measure. 

The KDR learning criterion is given by 



W* 



argmax tr 
s.t. WW 1 



L(K + nel n 



where L = TLT. T 



T - II 1 T T 

1 n " n ' -^J 



(7) 

L(Vi,Vj), 

K = TK r, Ki j = K(zi, Zj), and e is a rcgularization 
parameter. 

Solving the above optimization problem is cumber- 
some since the objective function is non-convex. In 
the original KDR paper (Fukumizu et al., 2009), a gra- 
dient method is employed for finding a local optimal 
solution. However, the gradient-based optimization 
is computationally demanding due to its slow conver- 
gence and it requires many restarts for finding a good 
local optima. Thus, KDR scales poorly to massive 
datasets. 

Another critical weakness of KDR is the kernel func- 
tion choice. The performance of KDR depends on the 
choice of kernel functions and the regularization pa- 
rameter, but there is no systematic model selection 
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method for KDR available. Using the Gaussian ker- 
nel with its width set to the median distance between 
samples is a standard heuristic in practice, but this 
does not always work very well. 

Furthermore, KDR lacks a good way to set an initial 
solution in the gradient procedure. Then, in practice, 
we need to run the algorithm many times with random 
initial points for finding good local optima. However, 
this makes the entire procedure even slower and the 
performance of dimension reduction unstable. 

The proposed SCA method can successfully overcome 
the above weaknesses of KDR — SCA is equipped with 
cross-validation for model selection (Section 3.1.3), 
its solution can be computed analytically (see Sec- 
tion 3.2), and a systematic initialization scheme is 
available (see Section 3.3). 

4.2. Least-Squares Dimensionality Reduction 

Least-squares dimension reduction (LSDR) is a re- 
cently proposed SDR method that can overcome the 
limitations of KDR (Suzuki & Sugiyama, 2010). That 
is, LSDR is equipped with a natural model selection 
procedure based on cross-validation. 

The proposed SCA can actually be regarded as a 
computationally efficient alternative to LSDR. In- 
deed, LSDR can also be interpreted as a dependence 
estimation-maximization algorithm (see Section 2.2), 
and the dependence estimation procedure is essentially 
the same as the proposed SCA, i.e., LSMI is used. The 
dependence maximization procedure is different from 
SCA — LSDR uses a natural gradient method (Amari, 
1998). 

In LSDR, the following SMI estimator is used: 

SMI = a T h - -a 1 Ha - -, 
2 2 

where a, h and^JJ arc defined in Section 3.1. Then 
the gradient of SMI is given by 

dSML dh T . ^ ~ ^ T dH 3 3 
^ T OR , 3 _ 

where (3 = (H + XR) 1 Ha. The natural gradient 
update of W, which takes into account the structure 
of the Stiefel manifold (Amari, 1998), is given by 



W 



Wexp r; W' 



9SMI aSMI 



where 'exp' for a matrix denotes the matrix exponen- 
tial. r\ > is a step size, which may be optimized by a 
line-search method such as Armijo 's rule (Patriksson, 
1999). 

Since cross-validation is available for model selection 
of LSMI, LSDR is more favorable than KDR. However, 
its optimization still relics on a gradient-based method 
and thus it is computationally expensive. 

Furthermore, there seems no good initialization 
scheme of the transformation matrix W. In the origi- 
nal paper by Suzuki & Sugiyama (2010), initial values 
were chosen randomly and the gradient method was 
run many times for finding a better local solution. 

The proposed SCA method can successfully over- 
come the above weaknesses of LSDR, by providing an 
analytic-form solution (see Section 3.2) and a system- 
atic initialization scheme (see Section 3.3). 

5. Experiments 

In this section, we experimentally investigate the per- 
formance of the proposed and existing SDR methods 
using artificial and real-world datascts. 

5.1. Artificial Datasets 

We use four artificial datascts, and compare the 
proposed SCA, LSDR 1 (Suzuki & Sugiyama, 2010), 
KDR 2 (Fukumizu ct al., 2009), sliced inverse regres- 
sion (SIR) 3 (Li, 1991), sliced average variance estima- 
tion (SAVE) 3 (Cook, 2000), and principal Hessian di- 
rection (pHd) 3 (Li, 1992). 

In SCA, we use the Gaussian kernel for y: 



L (y, vd = exp 



\y - m 

2cr v 



The identity matrix is used as regularization matrix 
R, and the kernel widths cr x , er y , and cr z as well as the 
regularization parameter A are chosen based on 5-fold 
cross-validation. 

The performance of each method is measured by 
1 



w ' w - w* ' w* 



Frobcnius; 



where || • ||Frobenius denotes the Frobcnius norm, W is 
an estimated transformation matrix, and W* is the 



dW dW 



W 



http : //sugiyama- www. cs . titech .ac.jp/- sugi /s of tware/LSDR/ index .html 

2 We used the program code provided by one of the au- 
thors of Fukumizu et al. (2009), which 'anneals' the Gaus- 
sian kernel width over gradient iterations. 

3 

http : //mirrors . dotsrc . org/cran/web/packages/dr/index . html 
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Figure 1. Artificial datasets. 

Table 1. Mean of Frobenius-norm error (with standard deviations in brackets) and mean CPU time over 100 trials. 
Computation time is normalized so that LSDR is one. LSDR was repeated 5 times with random initialization and the 
transformation matrix with the minimum CV score was chosen as the final solution. 'SCA(O)' indicates the performance 
of the initial transformation matrix obtained by the method described in Section 3.3. The best method in terms of the 
mean Frobenius-norm and comparable methods according to the t-test at the significance level 1% are specified by bold 
face. 



Datasets d m 


SCA(O) 


SCA 


LSDR 


KDR 


SIR 


SAVE 


pHd 


Datal 4 1 
Data2 10 1 
Data3 4 2 
Data4 5 1 


.089(.042) 
.078(.019) 
.065(.035) 
.118(.046) 


.048(.031) 
.007(.002) 
.018(.010) 
.042(.030) 


.056 (.021) 

.039 (.023) 
.090 (.069) 
.151 (.296) 


.048(.019) 

.024 (.007) 
.029(.119) 

.118 (.238) 


.257 (.168) 
.431 (.281) 
.362 (.182) 
.421 (.268) 


.339 (.218) 
.348 (.206) 
.343 (.213) 
.356 (.197) 


.593 (.210) 
.443 (.222) 
.437 (.231) 
.591 (.205) 


Time 


0.03 


0.49 


1.0 


0.96 


<0.01 


<0.01 


<0.01 



optimal transformation matrix. Note that the above 
error measure takes its value in [0, 1]. 

We use the following four datasets (see Figure 1): 
(a) Datal: 

Y = X 2 + 0.5£, 

where (Xi, . . . , X 4 ) T ~ U{{-1 l] 4 ) and E ~ 
N(0, 1). Here, U(S) denotes the uniform distri- 
bution on S, and N(fi, S) is the Gaussian distri- 
bution with mean \x and variance S. 



(b) Data2: 



Y = (X,) 2 + 0.1E, 



where (X lt . . . , X W ) T ~ JV(Oio,Iio) and E 
N(0,1). 



(c) Data3: 



Y 



0.5+(X 2 + 1.5) 



(1 + X 2 f +Q.1E, 



where (Xl,...,X 4 ) t - X(0 4 ,J 4 ) and E 
N(0,1). 



(d) Data4: 



JV(0,0.2) ifX 2 <|l/6| 
Y\X 2 ~{ 0.5^(1,0.2) otherwise 
+0.5JV(-1,0.2), 



where (Xx, 
N(0,1). 



Z7([— 0.5 0.5] 5 ) and E 



The performance of each method is summarized in Ta- 
ble 1, which depicts the mean and standard deviation 
of the Frobenius-norm error over 100 trials when the 
number of samples is n = 1000. As can be observed, 
the proposed SCA overall performs well. 'SCA(O)' in 
the table indicates the performance of the initial trans- 
formation matrix obtained by the method described 
in Section 3.3. The result shows that SCA(0) gives 
a reasonably good transformation matrix with a tiny 
computational cost. Note that KDR and LSDR have 
high standard deviation for Data3 and Data4, meaning 
that KDR and LSDR sometimes perform poorly. 

5.2. Multi-label Classification for Real-world 

Datasets 

Finally, we evaluate the performance of the proposed 
method in real-world multi-label classification prob- 
lems. 

5.2.1. Setup 

Below, we compare SCA, Multi-label Dimensionality 
reduction via Dependence Maximization (MDDM) 4 
(Zhang & Zhou, 2010), Canonical Correlation Anal- 



http://cs.nju. edu. cn/zhouzh/zhouzh. files /publication/ annex/MDDM . htm 
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ysis (CCA) 5 (Hotelling, 1936), and Principal Com- 
ponent Analysis (PCA) 6 (Bishop, 2006). Wc use 
a real-world image classification datasct called the 
PASCAL Visual Object Classes (VOC) 2010 dataset 
(Everingham et al., 2010) and a real- world automatic 
audio-tagging dataset called the Freesound dataset 
(The Freesound Project, 2011). Since the computa- 
tional costs of KDR and LSDR were unbearably large, 
we decided not to include them in the comparison. 

We employ the misclassification rate by the nearest- 
neighbor classifier as a performance measure: 

^ n c 

err = -VV I(y ik f y ik ), 
i=i k=i 

where c is the number of classes, y and y are the esti- 
mated and true labels, and I(y ^ y') is the indicator 
function. 

For SCA and MDDM, we use the following kernel func- 
tion (Sarwar et al., 2001) for y: 



where y is the sample mean: y = — X)"=i 

5.2.2. PASCAL VOC 2010 Dataset 

The VOC 2010 datasct consists of 20 binary classifica- 
tion tasks of identifying the existence of a person, aero- 
plane, etc. in each image. The total number of images 
in the dataset is 11319, and we used 1000 randomly 
chosen images for training and the rest for testing. 

In this experiment, we first extracted visual features 
from each image using the Speed Up Robust Features 
(SURF) algorithm (Bay et al., 2008), and obtained 
500 visual words as the cluster centers in the SURF 
space. Then, we computed a 500-dimensional bag-of- 
feature vector by counting the number of visual words 
in each image. We randomly sampled the training 
and test data 100 times, and computed the means and 
standard deviations of the classification error. 

The results are plotted in Figure 2(a), showing that 
SCA outperforms the existing methods, and SCA is 
the only method that outperforms 'ORP (no dimen- 
sion reduction) — SCA achieves almost the same error 
rate as 'ORP with only a 10-dimensional subspace. 

5.2.3. Freesound Dataset 

The Freesound dataset (The Freesound Project, 2011) 
consists of various audio files annotated with word tags 

5 

http : //www .mathworks . com/help/toolbox/stats/canoncorr .html 
http : //www .mathworks . com/help/toolbox/stats/princomp . html 



such as 'people', 'noisy', and 'restaurant'. We used 230 
tags in this experiment. The total number of audio 
files in the dataset is 5905, and we used 1000 randomly 
chosen audio files for training and the rest for testing. 

We first extracted Mel-Frequency Cepstrum Coeffi- 
cients (MFCC) (Rabiner & Juang, 1993) from each 
audio file, and obtained 1024 audio features as the 
cluster centers in MFCC. Then, wc computed a 
1024-dimensional bag-of- feature vector by counting the 
number of audio features in each audio file. We ran- 
domly chose the training and test samples 100 times, 
and computed the means and standard deviations of 
the classification error. 

The results plotted in Figure 2(b) show that, similarly 
to the image classification task, the proposed SCA out- 
performs the existing methods, and SCA is the only 
method that outperforms 'ORP'. 

6. Conclusion 

In this paper, we proposed a novel sufficient dimension 
reduction (SDR) method called sufficient component 
analysis (SCA), which is computationally more effi- 
cient than existing SDR methods. In SCA, a transfor- 
mation matrix was estimated by iteratively perform- 
ing dependence estimation and maximization, both of 
which are analytically carried out. Moreover, we de- 
veloped a systematic method to design a good ini- 
tial transformation matrix, which highly contributes 
to further reducing the computational cost and help 
obtain a good local optimum solution. We applied the 
proposed SCA to real-world image classification and 
audio tagging tasks, and experimentally showed that 
the proposed method is promising. 

Acknowledgments 

The authors thank Prof. Kcnji Fukumizu for provid- 
ing us the KDR code and Prof. Taiji Suzuki for his 
valuable comments. MY was supported by the JST 
PRESTO program. GN was supported by the MEXT 
scholarship. MS was supported by SCAT, AOARD, 
and the JST PRESTO program. 

References 

Amari, S. Natural gradient works efficiently in learn- 
ing. Neural Computation, 10:251-276, 1998. 

Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. Surf: 
Speeded up robust features. Computer Vision and 
Image Understanding, 110(3):346-359, 2008. 

Bishop, C. M. Pattern Recognition and Machine 



Sufficient Component Analysis 



0.126 
0.124 
0.122 
0.12 
0.118 

o 

[5 0.116 
0.114 
0.112 
0.11 
0.108 
0.106 





SCA 








CCA - 




PCA 




ORI 






— % 







20 40 60 80 100 120 
Number of reduced dimension 

(a) VOC 2010 dataset 



140 



0.0255 



0.025 



0.0245 



0.024 





SCA 

MDDM 

CCA 

PCA - 

ORI 


V ""nu„ 

""'III,,, 

'« ^--^ ""III,,,, 




v 

e °©0©©©e©G0©OQ0OOO 



20 40 60 80 100 

Number of reduced dimension 

(b) Freesound dataset 



120 



140 



Figure 2. Results on image classification with VOC 2010 dataset and audio classification with Freesound datasets. Mis- 
classification rate when the one-nearest-neighbor classifier is used as a classifier is reported. The best dimension reduction 
method in terms of the mean error and comparable methods according to the t-test at the significance level 1% are 
specified by 'o'. CCA can be applied to dimension reduction up to c dimensions, where c is the number of classes (c = 20 
in VOC 2010 and c = 230 in Freesound). 'ORF denotes the original data without dimension reduction. 
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